Skip to content

feat: estimate cycles#17

Open
not-matthias wants to merge 8 commits into
masterfrom
feat/cycle-estimation
Open

feat: estimate cycles#17
not-matthias wants to merge 8 commits into
masterfrom
feat/cycle-estimation

Conversation

@not-matthias

@not-matthias not-matthias commented Jun 9, 2026

Copy link
Copy Markdown
Member

No description provided.

@codspeed-hq

codspeed-hq Bot commented Jun 9, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by ×2.3

⚡ 9 improved benchmarks
❌ 2 regressed benchmarks
✅ 29 untouched benchmarks
⏩ 80 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
test_valgrind[valgrind-3.26.0, python3 testdata/test.py, full-no-inline] 6.8 s 16.2 s -57.94%
test_valgrind[valgrind-3.25.1, python3 testdata/test.py, full-with-inline] 6.9 s 8.7 s -20.69%
test_valgrind[valgrind-3.25.1, echo Hello, World!, full-with-inline] 612,627.1 ms 734.8 ms ×830
test_valgrind[valgrind-3.25.1, stress-ng --cpu 4 --cpu-ops 10, full-no-inline] 5.3 s 3.2 s +65.08%
test_valgrind[valgrind-3.26.0, stress-ng --cpu 4 --cpu-ops 10, full-no-inline] 5.3 s 3.2 s +64.34%
test_valgrind[valgrind-3.25.1, stress-ng --cpu 4 --cpu-ops 10, full-with-inline] 5.5 s 3.4 s +61.54%
test_valgrind[valgrind-3.26.0, stress-ng --cpu 4 --cpu-ops 10, full-with-inline] 5.6 s 3.5 s +61.15%
test_valgrind[valgrind-3.26.0, stress-ng --cpu 4 --cpu-ops 10, no-inline] 3.1 s 2 s +51.99%
test_valgrind[valgrind-3.25.1, stress-ng --cpu 4 --cpu-ops 10, no-inline] 3.1 s 2 s +51.31%
test_valgrind[valgrind-3.25.1, stress-ng --cpu 4 --cpu-ops 10, inline] 3.3 s 2.2 s +47.3%
test_valgrind[valgrind-3.26.0, stress-ng --cpu 4 --cpu-ops 10, inline] 3.3 s 2.2 s +46.97%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing feat/cycle-estimation (2bc9a1c) with master (fa9ee2e)

Open in CodSpeed

Footnotes

  1. 80 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@not-matthias not-matthias force-pushed the feat/cycle-estimation branch from 4b2e2b7 to 22e0934 Compare June 11, 2026 17:56
@not-matthias not-matthias marked this pull request as ready for review June 12, 2026 06:49
@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds per-instruction cycle estimation to Callgrind by integrating Capstone for real-time instruction decoding and generated LUT files (x86_caps_lut.inc, arm64_caps_lut.inc) that map packed instruction signatures to throughput and latency centi-cycle costs (Ct/Cl). It is enabled via a new --cycle-estimation=yes flag and wires the costs into the existing event-group framework alongside the cache and branch simulators.

  • New decoder pipeline: cycledecode_capstone.c opens a Capstone handle at post_clo_init time with all libc calls forwarded to Valgrind's freestanding coregrind libc; cycledecode.c performs a two-level LUT lookup (exact signature → width-agnostic → per-instruction default → 1.00-cycle flat fallback) and accumulates per-BB running sums for O(1) inclusive-cost updates at side exits.
  • Build system: Capstone is now a mandatory build dependency detected by configure.ac; all CI, release, and debian/rules paths switch to --enable-only64bit to avoid compiling the 32-bit secondary target without Capstone. A new composite GitHub Action builds the static Capstone and exports CAPSTONE_DIR.

Confidence Score: 5/5

Safe to merge; both findings are non-blocking edge cases that do not affect the common execution path.

The core decode-and-lookup pipeline, cost accumulation in cachesim_add_icost and setup_bbcc, and the build system wiring are all correct. Both findings are narrow edge cases that do not affect normal amd64/arm64 operation.

callgrind/main.c (zero-length IMark handling) and callgrind/cycledecode_capstone.h (i386 mode guard)

Important Files Changed

Filename Overview
callgrind/main.c Adds cycle cost decode at IMark time and running-sum compute at BB finalisation; contains a redundant ternary for len that passes VG_MIN_INSTR_SZB instead of 0 to Capstone when VEX reports an undecodable instruction.
callgrind/cycledecode_capstone.h New file: arch selection + Capstone handle API; CS_MODE_64 is used for both x86_64 and i386 hosts, which would silently decode 32-bit instructions incorrectly on an i386 build.
callgrind/cycledecode.c New file: per-instruction cycle cost lookup using Capstone + binary-searched LUT; logic is correct, with two-level fallback (width-agnostic retry, then per-instruction default).
callgrind/cycledecode_capstone.c New file: Capstone bridge + libc shims for nodefaultlibs Valgrind tool; shims follow coregrind conventions.
callgrind/sim.c Registers EG_CYCLES event group and wires per-instruction Ct/Cl cost accumulation into cachesim_add_icost; follows the existing conditional-register / unconditional-add-to-full pattern for EG_ALLOC and EG_SYS.
callgrind/bbcc.c Adds inclusive cycle cost (ct_incl/cl_incl) updates at side-exit handling for both skipped and non-skipped paths; guarded by cycle_estimation flag.
callgrind/global.h Adds cycle_estimation CLI flag, four UInt cycle-cost fields to InstrInfo, and the EG_CYCLES=9 event group constant.
callgrind/Makefile.am Adds cycledecode.c and cycledecode_capstone.c to CALLGRIND_SOURCES_COMMON and CAPSTONE_CFLAGS/LIBS to the primary target only; secondary target is avoided by --enable-only64bit in all build paths.
configure.ac Adds mandatory Capstone detection (--with-capstone or $CAPSTONE_DIR); errors clearly if missing.
bench/generate_config.py Extends CONFIGS with requires_codspeed flag; adds CODSPEED_VERSION constant and should_skip guard so --cycle-estimation configs are omitted for upstream Valgrind builds.
.github/actions/build-capstone/action.yml New composite action: builds a static Capstone (x86+arm64 only, no stack-protector/fortify) and exports CAPSTONE_DIR to subsequent steps.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Ist_IMark during CLG_instrument] -->|cycle_estimation=yes, !seen_before| B[clg_cycle_cost bytes from cia len]
    B --> C{cs_disasm_iter Capstone}
    C -->|decode ok| D[compute_sig arch-specific]
    D --> E{row_for exact sig}
    E -->|hit| H[ct_cost = row->cy cl_cost = row->cl]
    E -->|miss| F{row_for width-agnostic}
    F -->|hit| H
    F -->|miss| G{row_for sig==0 default}
    G -->|hit| H
    G -->|miss| I[fallback: 100 centi-cycles]
    C -->|decode fail| I
    H --> J[curr_inode->ct_cost / cl_cost]
    I --> J
    J --> K[BB finalise: compute ct_incl/cl_incl running sums]
    K --> L[Runtime: cachesim_add_icost cost += exe_count x ct_cost]
    K --> M[setup_bbcc side exit cost += ct_incl / cl_incl]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Ist_IMark during CLG_instrument] -->|cycle_estimation=yes, !seen_before| B[clg_cycle_cost bytes from cia len]
    B --> C{cs_disasm_iter Capstone}
    C -->|decode ok| D[compute_sig arch-specific]
    D --> E{row_for exact sig}
    E -->|hit| H[ct_cost = row->cy cl_cost = row->cl]
    E -->|miss| F{row_for width-agnostic}
    F -->|hit| H
    F -->|miss| G{row_for sig==0 default}
    G -->|hit| H
    G -->|miss| I[fallback: 100 centi-cycles]
    C -->|decode fail| I
    H --> J[curr_inode->ct_cost / cl_cost]
    I --> J
    J --> K[BB finalise: compute ct_incl/cl_incl running sums]
    K --> L[Runtime: cachesim_add_icost cost += exe_count x ct_cost]
    K --> M[setup_bbcc side exit cost += ct_incl / cl_incl]
Loading

Reviews (10): Last reviewed commit: "wip: use latest runner to fix samply" | Re-trigger Greptile

Comment thread callgrind/cycledecode.c Outdated
Comment thread callgrind/main.c
Comment thread bench/generate_config.py Outdated
Comment thread callgrind/cycledecode.c Outdated
@not-matthias not-matthias force-pushed the feat/cycle-estimation branch 3 times, most recently from 650e97b to 86ac213 Compare June 18, 2026 13:17

@GuillaumeLagrange GuillaumeLagrange left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

olgtm, do we have internal documentation on how we generated the LUT?

+ for curiosity: why do we need capstone? It could be made a bit clearer.
My understanding is that it's used to transform the instruction's operation to derive the ID for the LUT?

Comment thread .github/workflows/ci.yml Outdated
Comment thread .github/workflows/codspeed.yml Outdated
@not-matthias

Copy link
Copy Markdown
Member Author

olgtm, do we have internal documentation on how we generated the LUT?
Yes, see AvalancheHQ/valgrind-helpers repository.

  • for curiosity: why do we need capstone? It could be made a bit clearer. My understanding is that it's used to transform the instruction's operation to derive the ID for the LUT?

We need to build a LUT for each instruction, but we can't just take the raw bytes as some have 64-bit immediate params. So what we have to extract the parts that identify an instruction. Intel's XED decoder has IFORM which would be super helpful here, as it can identify each instruction by it's category. For example, XED_IFORM_MOV_MEMb_GPR8_DEFINED is mov [reg], reg].

We can't use XED as it's only for x86_64 and not ARM. Which is why we manually reconstruct something similar with Capstone.

Add the regenerated x86_caps_lut.inc / arm64_caps_lut.inc cost tables
consumed by the --cycle-estimation runtime.

- x86: Zen4-tuned reciprocal-throughput table.
- arm64: measured Cortex-A72 table, with a hand-frozen guide supplement for
  ops that are not benchmarked.
…-bit Capstone

The amd64 host builds both the primary (amd64) tool and a 32-bit x86
secondary tool, but Capstone is only built 64-bit, so CLG_WITH_CAPSTONE
is set only for the primary build. The secondary build compiled
cycledecode.c without it and tripped the mandatory-Capstone #error.

CodSpeed only ever runs the 64-bit tool, so build 64-bit only everywhere:
add --enable-only64bit to the CI configure, the release deb (debian/rules,
now unconditional), and the Justfile, and drop the now-unneeded
gcc-multilib / libc6-dev-i386 deps. This also roughly halves build time by
skipping the entire 32-bit toolchain.
Callgrind's cycle estimation links a static Capstone decoder. Add a
build step to ci, codspeed and release workflows that compiles Capstone
5.0.9 x86+arm64 only (other printers reference libc symbols the
-nodefaultlibs tool does not shim) and without stack-protector/fortify
(the tool runs without glibc's %fs TLS), then exports its prefix as
CAPSTONE_DIR for configure to pick up. Add cmake to the apt deps and
forward CAPSTONE_DIR through debuild -e in the release build.
… estimation

configure.ac gains --with-capstone=PATH (defaulting to $CAPSTONE_DIR)
and makes a static Capstone mandatory for the native tool, compiling the
decoder with fortify disabled since it links -nodefaultlibs. Makefile.am
adds the cycledecode sources/headers, ships the LUT .inc tables, and
passes the Capstone CFLAGS/LIBS. debian/rules forwards CAPSTONE_DIR to
configure via --with-capstone.
Decode the real guest bytes of each instruction (via Capstone) at first
translation and look up reciprocal-throughput (Ct) and latency (Cl)
estimates in the cost table. Register an EG_CYCLES event group exposing
Ct/Cl, accumulate self cost in the cache simulator and running inclusive
sums per BB so the call-graph cost at each side exit is an O(1) lookup.

Falls back to a flat 1.00 cycle (with a warning) on decode failure or no
table match, and disables itself if Capstone is unavailable for the guest.
@not-matthias not-matthias force-pushed the feat/cycle-estimation branch from 1a36a0a to 2bc9a1c Compare June 19, 2026 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants